Parallel Processor Configuration Design with Processing/Transmission Costs

نویسندگان

  • Saravut Charcranoon
  • Thomas G. Robertazzi
  • Serge Luryi
چکیده

ÐA computer configuration design problem where the objective is to configure a parallel processor to do processing in a cost effective manner is examined. The application envisioned is a specialized on-line service that rents time on its machine. The combinatorial optimization problem involved is examined analytically and a heuristic algorithm for its solution is provided. Lessons learned in this work appear in the conclusion. Index TermsÐCost, divisible load, economics, heuristic algorithm, local search, single-level tree network, star network. æ 1 INTRODUCTION OVER the past several decades, a great deal of research has been performed on the performance evaluation of computer systems. Today, because of the declining cost of computer hardware and the interest in electronic commerce and on-line services, the economics or cost evaluation of networked computation is as deserving of attention as are performance issues. In this paper, we examine a computer configuration design problem that has implications for the leasing of networked computer time. We envision a scenario where a specialized on-line service wishes to rent time on a high performance parallel machine to users. The question to be investigated is how the parallel processor configuration should be optimized so that, in some sense, the parallel processor can solve a submitted problem at minimal monetary cost to both the service, and by implication, to the user. A number of deliberate choices were made regarding the features of this problem: . single level tree (star) topology, . divisible load, . linear costs for communication and computation. Generally, these choices were made for analytical tractability. The divisible load model in particular has, over the years, seen its tractability proven [1] and well models problems involving data parallelism. In spite of these innocuous choices, the related mathematics is substantive. A secondary reason for the choices is that they allow a comparison with earlier published work by some of the authors [4], considering computation costs only, in bus networks. With these choices we seek, in this paper, to optimize the choice of which of a set of processors to connect to which of the tree network's links. This is a combinatorial optimization problem we call the aprocessor arrangemento problem. Unfortunately we have not been successful in devising a simple condition to implement an optimal processor arrangement profile (i.e., pairing of processors to links). Instead, expressions of moderate complexity for determining how to improve a given profile will be presented. A heuristic algorithm based on combinatorial local search principles will also be described. One more choice regarding the problem discussed in this paper requires some explanation. In this work, there are two objective functions to be optimized: the finish (solution) time and the total processing and transmission cost. It is well-known that there are several approaches to solve such multiple objective function optimization problems. The approach taken here is to find the minimal cost processor arrangement profile given that, for any profile, finish time is minimized using the methodology of [1]. That is, for each possible arrangement of processors, load is allocated so that all processors stop computing at the same time instant and finish time minimized for that specific arrangement profile. While other approaches are certainly possible, we believe that the proposed approach is a natural one for a first study. The paper is organized as follows: The model and load distribution scheme are presented in Section 2. In Section 3, processor arrangement and monetary cost models are discussed. Adjacent processor swapping is also discussed in this section. Cost efficient processor arrangements and the necessary cost improvement conditions in a single-level tree network are developed in Section 4. The heuristic cost efficient processor arrangement algorithm and its performance evaluation are discussed in Section 5. Finally, the conclusion and lessons learned appear in Section 6. 2 MODEL, NOTATION, AND LOAD DISTRIBUTION In this section, some necessary modeling, notation, and load distribution equations and their background are discussed. 2.1 Model Descriptions A single-level tree network where the root processor is equipped with a front-end processor for communications off-loading is considered. The presence of the front-end processor means that the root can compute and communicate simultaneously. A single-level tree network with (N ‡ 1) processors and (N) links is shown in Fig. 1. All the processors are connected to the root processor, p0, via communication links. Associated with the links and processors are the associated linear cost coefficients c1; c l 2; . . . ; c l N and cp0; c p 1; c p 2; . . . ; c p N , respectively. The root processor, assumed to be the only processor at which the load arrives, partitions the total processing load into (N ‡ 1) fractions, keeps its own fraction 0, and distributes the other fractions 1; 2; . . . ; N to the children processors p1; p2; . . . ; pN , respectively, and sequentially. We do not consider strategies of multiinstallment load distribution [1]. Each processor begins computing immediately after receiving its assigned fraction of load and continues without any interruption until all of its assigned load fraction has been processed. It is assumed that, compared to the size of the data, the time to report solutions back to the root is negligible. Let: i: The load fraction assigned to the ith link-processor pair. wi: The inverse of the computing speed of the ith processor. zi: The inverse of the link speed of the ith link. Tcp: Computing intensity constant: The entire load is processed in wiTcp seconds by the ith processor. Tcm: Communication intensity constant: The entire load can be transmitted in ziTcm seconds over the ith link. Tf : The finish time: Time at which the last processor ceases computation. Then, iwiTcp is the time to process the fraction i of the entire load on the ith processor. Note that the units of iwiTcp are [load] [sec/load] [dimensionless quantity] = [seconds]. Likewise, IEEE TRANSACTIONS ON COMPUTERS, VOL. 49, NO. 9, SEPTEMBER 2000 987 . S. Charcranoon is with Alcatel Corporate Research Center, 1201 E. Campbell Rd., Richardson, TX 75081-1936. E-mail: [email protected]. . T.G. Robertazzi and S. Luryi are with the Department of Electrical and Computer Engineering, State University of New York at Stony Brook, Stony Brook, NY 11794. E-mail: {tom, sluryi}@ece.sunysb.edu. Manuscript received 7 Aug. 1998; accepted 3 Apr. 2000. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112526. 0018-9340/00/$10.00 ß 2000 IEEE iziTcm is the time to transmit the fraction i of the entire load over the ith link. Note that the units of iziTcm are [load] [sec/load] [dimensionless quantity] = [seconds]. 2.2 Optimal Finish Time Load Distribution An equal division of load among processors does not in general give a minimum processing finish time, even in a homogeneous network. Instead, it is intuitive that, to minimize the processing finish time, the load distribution should be such that all processors finish computing at the same time. Otherwise, the processing finish time could be reduced by transferring some fractions of load from busy processors to idle processors. Formal proofs of this argument in the case of linear, bus, and tree networks appear in [1]. However, under certain sets of network parameters, in order to minimize the processing finish time, it is not necessary that all processors have to be utilized. In [1], conditions are found which determine which processors should be used to process the arriving load in the case of a single-level tree network. Still, the processors with nonzero assigned load have to finish computing at the same time. In this paper, it is assumed that all processors in the network are utilized. 2.3 Fundamental Recursive Equations The timing diagram of the process of load distribution in a single level tree network is given by Fig. 2. In this figure, each of the N ‡ 1 processors has a graph associated with it. Communication to and from each processor appears above the time axis and computation by each processor appears below the time axis. Fig. 2 shows the sequential distribution of load fractions, each processor commencing computation upon receiving its load fraction and all processors stopping at the same time. Again, it is assumed that solutions are small enough in comparison with the data that their transmission time back to the root is negligible. In [4], simple conditions were found for cost optimization for a single level tree network when only computation costs were taken into account. One might think that when communication costs are included that the communication and computation costs on each branch might be collapsed into a single equivalent cost. In fact, though, there are dependencies between the timing of events that make the actual situation more complex. For instance, the second link cannot receive load until the transport of load over the first link is complete. However, all of these dependencies can be taken into account through a series of chained linear equations. As discussed in Section 2.2, since all processors must stop computation at the same instant in order to achieve a minimum finish time, one can set up a series of chained linear equations reflecting this and all other timing relationships. These equations can be solved for the fractions of load, is, to be assigned to the processors: iwiTcp ˆ i‡1zi‡1Tcm ‡ i‡1wi‡1Tcp i ˆ 0; :::; N ÿ 1: …1† However, rather than solving a set of linear equations, it is simpler to chain the equations together recursively to yield: i‡1 ˆ ki i ˆ Yi jˆ0 kj ! 0 i ˆ 0; . . . ; N ÿ 1; …2†

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ultra-Low-Energy DSP Processor Design for Many-Core Parallel Applications

Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...

متن کامل

Fast Cellular Automata Implementation on Graphic Processor Unit (GPU) for Salt and Pepper Noise Removal

Noise removal operation is commonly applied as pre-processing step before subsequent image processing tasks due to the occurrence of noise during acquisition or transmission process. A common problem in imaging systems by using CMOS or CCD sensors is appearance of  the salt and pepper noise. This paper presents Cellular Automata (CA) framework for noise removal of distorted image by the salt an...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

Reliable Designing of Capacitated Logistics Network with Multi Configuration Structure under Disruptions: A Hybrid Heuristic Based Sample Average Approximation Algorithm

We consider the reliable multi configuration capacitated logistics network design problem (RMCLNDP) with system disruptions, concerned with facilities locating, transportation links constructing, and also allocating their limited capacities to the customers in order to satisfy their demands with a minimum expected total cost (including locating costs, link constructing costs, as well as expecte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Computers

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2000